Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Cosmos] Hybrid Search query pipeline #38275

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

simorenoh
Copy link
Member

@simorenoh simorenoh commented Nov 1, 2024

Adds support for performing full text search queries through the introduction of the hybrid search query pipeline. This consists of the newly added hybrid_search_aggregator, which performs the necessary query steps to obtain the needed results.

With these changes, the SDK can now interpret queries utilizing key functions like FullTextContains(), FullTextContainsAll(), FullTextContainsAny(), Order By Rank <FullTextFunction>(), and Order By Rank RRF().

The design doc for the implementation can be found here: Hybrid Search Doc.
The new README in this PR also has additional information.

Still missing in this PR at the moment:

  • Samples for both sync and async
  • Tests for both sync and async
  • Additional README information on how to run these queries.
  • Cleaning up/ aggregating some of the repeated logic.
  • Possible optimizations to query plan fetching - will probably be addressed in a separate PR. Issue: [Cosmos] make queries fetch query plan in every query #38577

@github-actions github-actions bot added the Cosmos label Nov 1, 2024
@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@simorenoh simorenoh marked this pull request as ready for review November 5, 2024 21:25
@simorenoh simorenoh requested review from annatisch and a team as code owners November 5, 2024 21:25
sdk/cosmos/azure-cosmos/README.md Outdated Show resolved Hide resolved
sdk/cosmos/azure-cosmos/README.md Outdated Show resolved Hide resolved
@@ -173,7 +173,7 @@ def __init__(self, execution_context, aggregate_operators):
for operator in aggregate_operators:
if operator == "Average":
self._local_aggregators.append(_AverageAggregator())
elif operator == "Count":
elif operator in ("Count", "CountIf"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no changes to _CountAggregator in this PR - how is this supposed to start supporting CountIf?

@@ -187,7 +187,7 @@ def __init__(self, execution_context, aggregate_operators):
for operator in aggregate_operators:
if operator == "Average":
self._local_aggregators.append(_AverageAggregator())
elif operator == "Count":
elif operator in ("Count", "CountIf"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes to CountAggregator - how woudl this support new Aggregate?

@@ -136,6 +145,22 @@ def _create_pipelined_execution_context(self, query_execution_info):
self._query,
self._options,
query_execution_info)
elif query_execution_info.has_hybrid_search_query_info():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can thge exception factory be refactroed to share across sync and async?

self.test_db = self.client.create_database(str(uuid.uuid4()))
self.test_container = self.test_db.create_container(
id="FTS" + self.TEST_CONTAINER_ID,
partition_key=PartitionKey(path="/id"),
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment I left for Java

think teh structure of the docs is not very helpful - You would need test cases to target individual physical partitions (both via PK in query request options or just via query plan by where ocndition) as well as real cross partition queries - technically you could still do it with id - but the addiitonal fulltext / hybrid-search becomes use-less on a single doc. So, I would argue there should be few logical partitions with 100 docs (coudl be same ids form the file) each -a nd then have tests searching across partiitons as wlel as scoped to individual partition - ideally also with HPK. You kind of have to add all these test cases to the matrix because we can't rely on it magically working given the extra step getting global statistics.

I think the test matrix will need to get extended quite a bit - maybe let's jump on a call Monday morning to discuss this - and also what tests are required for merge - vs. ok to add later.

Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall - main comment is about the test matrix/coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants